Massively parallel expressed sequence tag clustering

نویسندگان

  • Scott J. Emrich
  • Anantharaman Kalyanaraman
  • Srinivas Aluru
چکیده

Expressed Sequence Tag (EST) sequencing is a highly efficient technique that samples expressed genes required for most cellular functions. While this is a well-studied problem and many software tools have been developed, large-scale EST clustering has previously been pursued through incremental approaches, a pipeline of programs and manual efforts to achieve a modest degree of parallelism. Here, we present the first method that can directly cluster millions of ESTs on thousands of processors. This approach requires only linear space and uses rigorous alignment-based techniques to ensure biological accuracy. Further, we minimize computationally intensive alignments with single linkage clustering and develop a method to limit the formation of large spurious clusters. The computational scalability and biological validity of this approach is demonstrated by clustering mouse EST data, one of the two largest EST collections, on a 1,024 node BlueGene/L supercomputer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating the Significance of Global and Local Features in Expressed Sequence Tag: A Clustering Quality Perspective

Clustering of expressed sequence tag (EST) plays an important role in gene analysis. Alignment-based sequence comparison is commonly used to measure the similarity between sequences, and recently some of the alignment-free comparisons have been introduced. In this paper, we evaluate the role of global and local features extracted from the alignment free approaches i.e., compression-based method...

متن کامل

Evaluation of Expressed Sequence Tag Clustering

Bioinformatics — the application of computer technology to the management of biological information — is essential to deciphering the genetic code of life. Novel approaches to genome sequencing, such as microarray technology, high-performance supercomputing and computational simulations in high-throughput DNA analysis have led to an explosion of genomic data available. Accurate genomic assembly...

متن کامل

TparvaDB: a database to support Theileria parva vaccine development

We describe the development of TparvaDB, a comprehensive resource to facilitate research towards development of an East Coast fever vaccine, by providing an integrated user-friendly database of all genome and related data currently available for Theileria parva. TparvaDB is based on the Generic Model Organism Database (GMOD) platform. It contains a complete reference genome sequence, Expressed ...

متن کامل

Not All Sequence Tags Are Created Equal: Designing and Validating Sequence Identification Tags Robust to Indels

Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis errors do not cause tags to be unrecover...

متن کامل

Rapid evolution and selection inferred from the transcriptomes of sympatric crater lake cichlid fishes.

Crater lakes provide a natural laboratory to study speciation of cichlid fishes by ecological divergence. Up to now, there has been a dearth of transcriptomic and genomic information that would aid in understanding the molecular basis of the phenotypic differentiation between young species. We used next-generation sequencing (Roche 454 massively parallel pyrosequencing) to characterize the dive...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007